Conversation
Extend the platform to support model-level competitions where users submit vLLM forks as tarballs. The system pip installs the fork, starts a vLLM server, runs serving benchmarks, and checks perplexity against a baseline. - Add Language.Model and RankCriterion.CUSTOM to support model tasks - Add ModelTaskData with benchmark shapes, perplexity config, timeouts - Add run_model_benchmark() with 5-phase pipeline (install, server, perplexity, benchmark, cleanup) - Add score_ascending field for higher-is-better ranking (throughput vs time) - Add tarball upload support (50MB limit) in API - Add Modal image with vLLM deps, sccache, and model weights volume - Add download_model.py for pre-populating model weights - Add example task definition for Llama-3.1-8B serving - Add reuse documentation listing unchanged components
There was a problem hiding this comment.
Pull request overview
Adds end-to-end “model competition” support where users submit vLLM forks as archives that are installed and benchmarked via a new runner path, with leaderboard ranking able to support both lower-is-better and higher-is-better scores.
Changes:
- Introduces
Language.Model+ModelTaskData, plusrun_model_benchmark()pipeline (install → serve → perplexity → benchmark → cleanup). - Adds score direction (
score_ascending) wiring through task config, DB ranking queries, and API responses. - Extends submission handling to accept binary archives (50MB) and adds Modal infra (new image + volumes) and a weight pre-download script.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_task.py | Updates expected task config dicts to include score_ascending. |
| src/runners/modal_runner_archs.py | Registers Modal functions for model benchmarking on selected GPUs with volumes mounted. |
| src/runners/modal_runner.py | Adds dedicated model_image and Modal Volumes for model weights + sccache. |
| src/runners/download_model.py | Adds a Modal app to pre-download HF model weights into a shared volume. |
| src/libkernelbot/task.py | Adds ModelTaskData, extends LeaderboardTask to support model tasks + score_ascending. |
| src/libkernelbot/submission.py | Adds custom metric scoring, and threads score_ascending into competition/ranking display. |
| src/libkernelbot/run_eval.py | Routes lang=model to new run_model_benchmark() implementation. |
| src/libkernelbot/leaderboard_db.py | Stores bytes submissions and adds ranking direction support to leaderboard queries. |
| src/libkernelbot/launchers/modal.py | Dispatches Modal function name based on lang including model. |
| src/libkernelbot/consts.py | Adds Language.Model and RankCriterion.CUSTOM. |
| src/libkernelbot/backend.py | Base64-encodes model archives for transport and avoids .lower() on bytes. |
| src/kernelbot/api/main.py | Ensures /submissions endpoint uses correct score ordering for the given leaderboard. |
| src/kernelbot/api/api_utils.py | Accepts larger binary uploads for model tasks (50MB) and validates archive extension. |
| examples/llama_8b_serving/task.yml | Adds an example model task configuration (custom ranking metric + descending score). |
| docs/model-competitions-reuse.md | Documents which existing components are reused unchanged for model competitions. |
Comments suppressed due to low confidence (1)
src/runners/modal_runner.py:1
- These pins look risky: I’m not aware of a
torch==2.9.1release or acu130wheel index in the standard PyTorch distribution scheme. If this is intentional for your environment, consider documenting/validating it; otherwise, pin to a known-available Torch/CUDA combo (or make it configurable) to avoid Modal image build failures.
import signal
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/libkernelbot/run_eval.py
Outdated
| if tarfile.is_tarfile(archive_path): | ||
| with tarfile.open(archive_path, "r:*") as tar: | ||
| tar.extractall(path=extract_dir) | ||
| elif zipfile.is_zipfile(archive_path): | ||
| with zipfile.ZipFile(archive_path, "r") as zf: | ||
| zf.extractall(path=extract_dir) | ||
| else: | ||
| return False, "", "Submission archive is not a valid tar.gz or zip file" | ||
|
|
There was a problem hiding this comment.
tar.extractall() / ZipFile.extractall() are vulnerable to path traversal (e.g., ../../...) and can write outside extract_dir. Use a safe extraction routine that validates each member path stays within extract_dir (reject absolute paths and .. segments) before extracting.
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| tar.extractall(path=extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| zf.extractall(path=extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| def _safe_extract_tar(tar: tarfile.TarFile, dest_dir: str) -> None: | |
| base_dir = os.path.abspath(dest_dir) | |
| for member in tar.getmembers(): | |
| name = member.name | |
| # Disallow absolute paths | |
| if os.path.isabs(name): | |
| raise ValueError(f"Unsafe absolute path in tar archive: {name!r}") | |
| # Disallow parent directory traversal | |
| if ".." in Path(name).parts: | |
| raise ValueError(f"Unsafe relative path in tar archive: {name!r}") | |
| target_path = os.path.abspath(os.path.join(base_dir, name)) | |
| # Ensure the target path is within dest_dir | |
| if os.path.commonpath([base_dir, target_path]) != base_dir: | |
| raise ValueError(f"Tar path escapes destination directory: {name!r}") | |
| tar.extractall(path=dest_dir) | |
| def _safe_extract_zip(zf: zipfile.ZipFile, dest_dir: str) -> None: | |
| base_dir = os.path.abspath(dest_dir) | |
| for name in zf.namelist(): | |
| # Disallow absolute paths | |
| if os.path.isabs(name): | |
| raise ValueError(f"Unsafe absolute path in zip archive: {name!r}") | |
| # Disallow parent directory traversal | |
| if ".." in Path(name).parts: | |
| raise ValueError(f"Unsafe relative path in zip archive: {name!r}") | |
| target_path = os.path.abspath(os.path.join(base_dir, name)) | |
| # Ensure the target path is within dest_dir | |
| if os.path.commonpath([base_dir, target_path]) != base_dir: | |
| raise ValueError(f"Zip path escapes destination directory: {name!r}") | |
| zf.extractall(path=dest_dir) | |
| try: | |
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| _safe_extract_tar(tar, extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| _safe_extract_zip(zf, extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| except ValueError as e: | |
| return False, "", f"Submission archive contains unsafe paths: {e}" |
| work_dir = tempfile.mkdtemp(prefix="model_submission_") | ||
| archive_path = os.path.join(work_dir, "submission.tar.gz") | ||
|
|
||
| with open(archive_path, "wb") as f: | ||
| f.write(archive_bytes) | ||
|
|
||
| # Extract | ||
| import tarfile | ||
| import zipfile | ||
|
|
||
| extract_dir = os.path.join(work_dir, "src") | ||
| os.makedirs(extract_dir, exist_ok=True) | ||
|
|
||
| if tarfile.is_tarfile(archive_path): | ||
| with tarfile.open(archive_path, "r:*") as tar: | ||
| tar.extractall(path=extract_dir) | ||
| elif zipfile.is_zipfile(archive_path): | ||
| with zipfile.ZipFile(archive_path, "r") as zf: | ||
| zf.extractall(path=extract_dir) | ||
| else: | ||
| return False, "", "Submission archive is not a valid tar.gz or zip file" | ||
|
|
||
| # Find the actual package directory (may be nested one level) | ||
| entries = os.listdir(extract_dir) | ||
| if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])): | ||
| pkg_dir = os.path.join(extract_dir, entries[0]) | ||
| else: | ||
| pkg_dir = extract_dir | ||
|
|
||
| # pip install | ||
| result = subprocess.run( | ||
| ["pip", "install", "-e", pkg_dir], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=install_timeout, | ||
| ) | ||
|
|
||
| return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr) |
There was a problem hiding this comment.
tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).
| work_dir = tempfile.mkdtemp(prefix="model_submission_") | |
| archive_path = os.path.join(work_dir, "submission.tar.gz") | |
| with open(archive_path, "wb") as f: | |
| f.write(archive_bytes) | |
| # Extract | |
| import tarfile | |
| import zipfile | |
| extract_dir = os.path.join(work_dir, "src") | |
| os.makedirs(extract_dir, exist_ok=True) | |
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| tar.extractall(path=extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| zf.extractall(path=extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| # Find the actual package directory (may be nested one level) | |
| entries = os.listdir(extract_dir) | |
| if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])): | |
| pkg_dir = os.path.join(extract_dir, entries[0]) | |
| else: | |
| pkg_dir = extract_dir | |
| # pip install | |
| result = subprocess.run( | |
| ["pip", "install", "-e", pkg_dir], | |
| capture_output=True, | |
| text=True, | |
| timeout=install_timeout, | |
| ) | |
| return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr) | |
| with tempfile.TemporaryDirectory(prefix="model_submission_") as work_dir: | |
| archive_path = os.path.join(work_dir, "submission.tar.gz") | |
| with open(archive_path, "wb") as f: | |
| f.write(archive_bytes) | |
| # Extract | |
| import tarfile | |
| import zipfile | |
| extract_dir = os.path.join(work_dir, "src") | |
| os.makedirs(extract_dir, exist_ok=True) | |
| if tarfile.is_tarfile(archive_path): | |
| with tarfile.open(archive_path, "r:*") as tar: | |
| tar.extractall(path=extract_dir) | |
| elif zipfile.is_zipfile(archive_path): | |
| with zipfile.ZipFile(archive_path, "r") as zf: | |
| zf.extractall(path=extract_dir) | |
| else: | |
| return False, "", "Submission archive is not a valid tar.gz or zip file" | |
| # Find the actual package directory (may be nested one level) | |
| entries = os.listdir(extract_dir) | |
| if len(entries) == 1 and os.path.isdir(os.path.join(extract_dir, entries[0])): | |
| pkg_dir = os.path.join(extract_dir, entries[0]) | |
| else: | |
| pkg_dir = extract_dir | |
| # pip install | |
| result = subprocess.run( | |
| ["pip", "install", "-e", pkg_dir], | |
| capture_output=True, | |
| text=True, | |
| timeout=install_timeout, | |
| ) | |
| return result.returncode == 0, _limit_length(result.stdout), _limit_length(result.stderr) |
| extract_dir = os.path.join(work_dir, "src") | ||
| os.makedirs(extract_dir, exist_ok=True) |
There was a problem hiding this comment.
tempfile.mkdtemp() creates a work directory that is never removed, which can leak disk space across runs. Prefer TemporaryDirectory() or explicitly shutil.rmtree(work_dir) in a finally (including the error/early-return paths).
src/libkernelbot/run_eval.py
Outdated
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.PIPE, |
There was a problem hiding this comment.
Starting the server with stdout=PIPE and stderr=PIPE without continuously draining them risks blocking the vLLM process once its output buffers fill, potentially hanging runs. Redirect to files/DEVNULL, merge streams, or spawn reader threads to drain and store logs safely.
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE, | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.DEVNULL, |
src/libkernelbot/run_eval.py
Outdated
| cmd = [ | ||
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | ||
| ] | ||
|
|
||
| # Prefer the benchmark_serving script approach | ||
| cmd = [ | ||
| "python3", "-m", "vllm.benchmarks.benchmark_serving", | ||
| "--backend", "openai-chat", | ||
| "--base-url", f"http://localhost:{port}", | ||
| "--model", model_name, | ||
| "--endpoint", "/v1/chat/completions", | ||
| "--num-prompts", str(shape.get("num_prompts", 100)), | ||
| "--random-input-len", str(shape.get("input_len", 512)), | ||
| "--random-output-len", str(shape.get("output_len", 128)), | ||
| "--save-result", | ||
| ] | ||
|
|
||
| result = subprocess.run( | ||
| cmd, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=benchmark_timeout, | ||
| ) | ||
|
|
||
| if result.returncode != 0: | ||
| all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr) | ||
| continue | ||
|
|
||
| # Parse the saved JSON result file | ||
| # vLLM saves to a json file in current directory | ||
| import glob | ||
| json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True) | ||
| if json_files: | ||
| try: | ||
| with open(json_files[0]) as f: | ||
| bench_result = json.load(f) | ||
| for key in [ | ||
| "request_throughput", | ||
| "output_throughput", | ||
| "mean_ttft_ms", | ||
| "median_ttft_ms", | ||
| "p99_ttft_ms", | ||
| "mean_tpot_ms", | ||
| "median_tpot_ms", | ||
| "p99_tpot_ms", | ||
| "mean_itl_ms", | ||
| "median_itl_ms", | ||
| "p99_itl_ms", | ||
| ]: | ||
| if key in bench_result: | ||
| all_metrics[key] = bench_result[key] | ||
| os.remove(json_files[0]) | ||
| except (json.JSONDecodeError, OSError): | ||
| pass | ||
|
|
||
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) |
There was a problem hiding this comment.
Metrics are overwritten across shapes because all_metrics[key] is reused for every shape; only the last shape’s values will survive. Also, glob('*.json') in the current working directory can pick up unrelated files and is race-prone. Write results to a per-shape, known filepath (or run in a temp working directory) and namespace metrics per shape (e.g., shape_{i}_{key}) or return a list keyed by shape.
| cmd = [ | |
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | |
| ] | |
| # Prefer the benchmark_serving script approach | |
| cmd = [ | |
| "python3", "-m", "vllm.benchmarks.benchmark_serving", | |
| "--backend", "openai-chat", | |
| "--base-url", f"http://localhost:{port}", | |
| "--model", model_name, | |
| "--endpoint", "/v1/chat/completions", | |
| "--num-prompts", str(shape.get("num_prompts", 100)), | |
| "--random-input-len", str(shape.get("input_len", 512)), | |
| "--random-output-len", str(shape.get("output_len", 128)), | |
| "--save-result", | |
| ] | |
| result = subprocess.run( | |
| cmd, | |
| capture_output=True, | |
| text=True, | |
| timeout=benchmark_timeout, | |
| ) | |
| if result.returncode != 0: | |
| all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr) | |
| continue | |
| # Parse the saved JSON result file | |
| # vLLM saves to a json file in current directory | |
| import glob | |
| json_files = sorted(glob.glob("*.json"), key=os.path.getmtime, reverse=True) | |
| if json_files: | |
| try: | |
| with open(json_files[0]) as f: | |
| bench_result = json.load(f) | |
| for key in [ | |
| "request_throughput", | |
| "output_throughput", | |
| "mean_ttft_ms", | |
| "median_ttft_ms", | |
| "p99_ttft_ms", | |
| "mean_tpot_ms", | |
| "median_tpot_ms", | |
| "p99_tpot_ms", | |
| "mean_itl_ms", | |
| "median_itl_ms", | |
| "p99_itl_ms", | |
| ]: | |
| if key in bench_result: | |
| all_metrics[key] = bench_result[key] | |
| os.remove(json_files[0]) | |
| except (json.JSONDecodeError, OSError): | |
| pass | |
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) | |
| with tempfile.TemporaryDirectory() as tmpdir: | |
| cmd = [ | |
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | |
| ] | |
| # Prefer the benchmark_serving script approach | |
| cmd = [ | |
| "python3", "-m", "vllm.benchmarks.benchmark_serving", | |
| "--backend", "openai-chat", | |
| "--base-url", f"http://localhost:{port}", | |
| "--model", model_name, | |
| "--endpoint", "/v1/chat/completions", | |
| "--num-prompts", str(shape.get("num_prompts", 100)), | |
| "--random-input-len", str(shape.get("input_len", 512)), | |
| "--random-output-len", str(shape.get("output_len", 128)), | |
| "--save-result", | |
| ] | |
| result = subprocess.run( | |
| cmd, | |
| capture_output=True, | |
| text=True, | |
| timeout=benchmark_timeout, | |
| cwd=tmpdir, | |
| ) | |
| if result.returncode != 0: | |
| all_metrics[f"shape_{i}_error"] = _limit_length(result.stderr) | |
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) | |
| continue | |
| # Parse the saved JSON result file | |
| # vLLM saves to a json file in the working directory | |
| import glob | |
| json_files = sorted( | |
| glob.glob(os.path.join(tmpdir, "*.json")), | |
| key=os.path.getmtime, | |
| reverse=True, | |
| ) | |
| if json_files: | |
| try: | |
| with open(json_files[0]) as f: | |
| bench_result = json.load(f) | |
| for key in [ | |
| "request_throughput", | |
| "output_throughput", | |
| "mean_ttft_ms", | |
| "median_ttft_ms", | |
| "p99_ttft_ms", | |
| "mean_tpot_ms", | |
| "median_tpot_ms", | |
| "p99_tpot_ms", | |
| "mean_itl_ms", | |
| "median_itl_ms", | |
| "p99_itl_ms", | |
| ]: | |
| if key in bench_result: | |
| all_metrics[f"shape_{i}_{key}"] = bench_result[key] | |
| os.remove(json_files[0]) | |
| except (json.JSONDecodeError, OSError): | |
| pass | |
| all_metrics[f"shape_{i}_stdout"] = _limit_length(result.stdout) |
src/libkernelbot/run_eval.py
Outdated
| cmd = [ | ||
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | ||
| ] | ||
|
|
There was a problem hiding this comment.
The initial cmd assignment to vllm.entrypoints.openai.run_batch is immediately overwritten and has no effect. Remove the dead assignment to reduce confusion and keep the benchmark invocation single-sourced.
| cmd = [ | |
| "python3", "-m", "vllm.entrypoints.openai.run_batch", | |
| ] |
| try: | ||
| with urllib.request.urlopen(req, timeout=30) as resp: | ||
| data = json.loads(resp.read()) |
There was a problem hiding this comment.
The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.
src/libkernelbot/run_eval.py
Outdated
| except Exception: | ||
| continue |
There was a problem hiding this comment.
The perplexity check silently ignores all request/parse errors and may compute perplexity from a small subset of prompts, which can lead to unstable or falsely passing results. Consider failing the check on any request error (or at least tracking an error count and requiring a minimum success ratio) and include the error details in the run result for debuggability.
|
|
||
| def compute_score(result: FullResult, task: LeaderboardTask, submission_id: int) -> float: | ||
| if task.ranking_by == RankCriterion.CUSTOM: | ||
| ranking_metric = task.config.ranking_metric |
There was a problem hiding this comment.
RankCriterion.CUSTOM implicitly assumes task.config has ranking_metric, but LeaderboardTask.config can also be CudaTaskData/PythonTaskData, which don’t define it. Enforce CUSTOM only for Language.Model (e.g., in LeaderboardTask.__post_init__) or store ranking_metric at the task level so this doesn’t depend on a specific config dataclass.
| ranking_metric = task.config.ranking_metric | |
| # Some task configurations (e.g., CudaTaskData/PythonTaskData) may not | |
| # define a `ranking_metric` attribute. Guard against that here so we | |
| # don't rely on a specific config dataclass shape. | |
| config = getattr(task, "config", None) | |
| if config is None or not hasattr(config, "ranking_metric"): | |
| raise KernelBotError( | |
| "RankCriterion.CUSTOM requires task.config to define a 'ranking_metric' " | |
| f"attribute; got config type '{type(config).__name__}' instead." | |
| ) | |
| ranking_metric = getattr(config, "ranking_metric") |
| return passed, measured_ppl | ||
|
|
||
|
|
||
| def run_model_benchmark(config: dict) -> FullResult: # noqa: C901 |
There was a problem hiding this comment.
The new run_model_benchmark() path (install, server startup/timeout handling, perplexity pass/fail, benchmark parsing, and cleanup) introduces substantial logic but isn’t covered by unit tests. Since the repo already has pytest coverage (e.g., tests/test_task.py), add focused tests that mock subprocess.run / subprocess.Popen and urllib.request.urlopen to deterministically validate success and failure modes.
- Fix path traversal vulnerability in tar/zip extraction (validate members) - Fix metrics overwritten across shapes (namespace by shape index) - Fix vLLM server stdout/stderr PIPE blocking (redirect to DEVNULL) - Fix perplexity check silently swallowing errors (require >50% success) - Remove dead cmd assignment in benchmark runner - Add hasattr guard for CUSTOM ranking_metric in compute_score - Remove docs/model-competitions-reuse.md
- Fix lang_name KeyError crash for model submissions in GitHub launcher - Upload model archives as Git blobs to bypass workflow dispatch size limits - Add nvidia_model_workflow.yml with 60-min timeout for model benchmarking - Update github-runner.py to download blob archives before running - Add model-specific timeout computation from model_config - Add expected run name pattern for model workflow dispatch - Block model competitions on AMD GPUs (NVIDIA only for now)
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Isolates model benchmark dependencies in a venv instead of polluting the runner's system Python. Falls back to pip if uv is not available.
- Persistent venv at /opt/model-venv with torch + vLLM deps pre-cached (mirrors Modal model_image pattern: install vllm for deps, uninstall) - Set SETUPTOOLS_SCM_PRETEND_VERSION for tarball submissions without .git - Pin Python 3.10 in venv, add sccache for CUDA compilation caching
Drop /opt persistent venv (permission issues on containerized runners). Bootstrap fresh venv each run with torch + vllm deps. Optimize later.
- Only use --download-dir /models if the path exists (Modal volume). On GitHub runners, fall back to HF cache default. - Capture server stdout/stderr to a log file instead of DEVNULL. - Include server log in result on startup failure for debugging.
Calibrated from actual B200 E2E test run with stock vLLM.
…afety - Switch model_image from CUDA 13.1 to CUDA 12.8 base so the vLLM wheel (compiled for CUDA 12) works natively without compat libraries or source builds. CUDA 12.8 supports H100 (SM 9.0) and B200 (SM 10.0). - Use vllm bench serve CLI (python3 -m vllm.entrypoints.cli.main bench serve) instead of the deprecated benchmarks/benchmark_serving.py script. Use --backend openai with /v1/completions for base models. - Add fast overlay path: for Python-only submissions, copy .py files directly onto the pre-installed vLLM package instead of doing a full pip install from source. Includes backup/restore safety to detect and recover if an overlay breaks vLLM imports. - Add GPU cleanup (pkill + torch.cuda.empty_cache) before server start to handle Modal container reuse where previous vLLM processes left GPU memory allocated. - Add HF secret to model benchmark functions for gated model access. - Fix download_model.py to save weights at /models/<org>/<model> path matching what _resolve_model_ref() expects. Tested E2E on Modal H100: perplexity 1.7785 (pass), benchmark 34.54 req/s with 100/100 successful requests.
- Pop gpus from raw dict before LeaderboardTask.from_dict() to prevent unexpected keyword argument error when creating dev leaderboards - Handle binary model archives in get_submission_by_id() with errors="replace" to prevent UnicodeDecodeError on tar.gz data - Add .claude/skills/model-competition-testing.md with full E2E testing instructions, correctness criteria, and troubleshooting
…guide Adds Step 5b covering the streaming SSE endpoint flow via popcorn-cli, including config backup, build, submit with --no-tui, and config restore.
- Switch GH workflow torch from cu130 to cu128 (vLLM pip wheel needs libcudart.so.12) - Keep vLLM installed instead of uninstalling — enables fast overlay path for Python-only submissions (~instant vs ~20 min) - Use sys.executable instead of "python3" for vLLM server and benchmark subprocesses so they use the venv Python - Add CUDA_VISIBLE_DEVICES=4,5,6,7 to workflow (GPUs 0-3 occupied) - Add B200 machine entry to remote-gpu-testing skill - Add GH Actions B200 section to model-competition-testing skill
Summary
End-to-end support for model competitions where users submit vLLM forks and are benchmarked on serving throughput/latency. This mirrors the existing kernel submission flow but for full model inference serving.
Language.Modeltype withModelTaskDataconfig (model name, tensor parallel, benchmark shapes, perplexity baseline)run_model_benchmark()— 4-phase pipeline: extract archive → install fork (fast overlay or pip) → start vLLM server → perplexity check → benchmark servingnvidia_model_workflow.yml) for B200 self-hosted runnersscore_ascendingfield for higher-is-better metrics (e.g., throughput)E2E Testing Status
Full API → Modal → DB pipeline (H100) — FULLY WORKING
Tested the complete round-trip: HTTP submission → API server → Background Manager → Modal dispatch → H100 runner → FullResult → score computation → DB storage → leaderboard ranking.
POST /submission/llama_8b_serving-dev/H100/leaderboardreturns 202run_model_benchmark_h100found and invokedrequest_throughput = 42.10extracted viaRankCriterion.CUSTOMsubmission_job_status.status = 'succeeded'GET /user/submissionsreturns submission with scorellama_8b_serving-devwith score 42.10Total pipeline time: ~3 minutes (warm container with cached weights).
Popcorn-CLI → SSE → Modal → DB (H100) — FULLY WORKING
Tested the CLI streaming flow: popcorn-cli submit → SSE endpoint → Background Manager → Modal dispatch → result callback → DB storage.
--mode test--mode leaderboardCLI uses streaming SSE endpoint (
POST /{leaderboard}/{gpu}/{mode}) with--no-tuifor non-interactive use.GitHub Actions route (B200) — FULLY WORKING
Tested manually on B200 self-hosted runner (
l-bgx-01, 8x B200). All 4 phases pass:Key fixes for B200 route:
libcudart.so.12)sys.executableinstead of"python3"for subprocesses (venv python)CUDA_VISIBLE_DEVICES=4,5,6,7(GPUs 0-3 occupied)/models/meta-llama/Llama-3.1-8BBugs found and fixed during E2E
task.py—gpuskeyword error:gpusfield fromtask.ymlwas passed toLeaderboardTask.__init__()which doesn't accept it. Fixed by poppinggpusbeforefrom_dict().leaderboard_db.py— binary archive decode crash:get_submission_by_id()tried to UTF-8 decode binary tar.gz archives, causingUnicodeDecodeError. Fixed witherrors="replace"..envtokens pointed togpu-modeworkspace but deploy went tomsaroufimworkspace via profile. API server must use matching tokens.run_eval.py— subprocess used system python:_start_vllm_server()and benchmark used"python3"which resolved to/usr/bin/python3(no vLLM). Fixed withsys.executable.Key implementation fixes (from earlier iterations)
vllm bench serveCLI (python3 -m vllm.entrypoints.cli.main bench serve) instead of deprecatedbenchmark_serving.py--backend openaiwith/v1/completions(notopenai-chat) for base models like Llama-3.1-8Bpkill+torch.cuda.empty_cache()) before server start for container reusedownload_model.pyto save weights at/models/<org>/<model>matching_resolve_model_ref()How correctness is defined
Model submissions are validated through a two-phase gate defined in
task.yml:Phase 1: Perplexity check (correctness gate)
/v1/completionsendpointmeasured_ppl = exp(-total_log_prob / total_tokens)abs(measured - baseline) / baseline <= tolerance(within 2% of baseline 1.80)Phase 2: Serving benchmark (ranking metric)
vllm bench servewith specified shapes (1000 prompts, 512 input len, 128 output len)request_throughput(req/s) as the leaderboard scorescore_ascending: false,ranking_by: custom)The perplexity baseline (1.80) was established by running unmodified vLLM against Llama-3.1-8B on H100.
Remaining work
Perplexity / determinism
Performance — GitHub Actions route
Nice to have
facebook/opt-125m) for CI smoke testsTest plan
test_backend.py,test_task.py)task.ymlwithgpusfieldsys.executablefix for venv subprocess resolution